We will compare some classifiers on the “Toxic” column.
Load libraries
library(tidyverse)
package 㤼㸱tidyverse㤼㸲 was built under R version 4.0.5Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
-- Attaching packages --------------------------------------------------------------------------------------------------------------------------- tidyverse 1.3.1 --
v ggplot2 3.3.5 v purrr 0.3.4
v tibble 3.1.3 v dplyr 1.0.7
v tidyr 1.1.3 v stringr 1.4.0
v readr 2.0.1 v forcats 0.5.1
package 㤼㸱ggplot2㤼㸲 was built under R version 4.0.5package 㤼㸱tibble㤼㸲 was built under R version 4.0.5package 㤼㸱tidyr㤼㸲 was built under R version 4.0.5package 㤼㸱purrr㤼㸲 was built under R version 4.0.5package 㤼㸱dplyr㤼㸲 was built under R version 4.0.5package 㤼㸱stringr㤼㸲 was built under R version 4.0.5package 㤼㸱forcats㤼㸲 was built under R version 4.0.5-- Conflicts ------------------------------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
library(tictoc)
package 㤼㸱tictoc㤼㸲 was built under R version 4.0.5
library(caret)
package 㤼㸱caret㤼㸲 was built under R version 4.0.5Loading required package: lattice
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Attaching package: 㤼㸱caret㤼㸲
The following object is masked from 㤼㸱package:purrr㤼㸲:
lift
library(class)
source("./parameters.R")
# Number of nearest neighbors taken into account
k = 5
# We open a relatively small Bag of Words in order to limit calculation time
fileName = "bow_tfidf__min_words_100_2grams_1000__sampling_balanced__cor_cut_0.3_from_1408_to_1110_rm0.csv"
df = read_csv(fileName, col_types=col_types_df)
df = df[,-c(2,3,5:9)]
df
KNN is not really friend with lines full of zeros in the bag of words. All these lines have way to many neighbors. So let’s ensure there is none.
# Go through each row, return TRUE is at least one value is not zero
non_zero_rows = apply(df[,-1], 1, function(row) any(row !=0 ))
writeLines(paste0("Rows full of zeros: ",sum(!non_zero_rows, na.rm = TRUE)))
Rows full of zeros: 0
# Subset
df = df[non_zero_rows,]
writeLines(paste0("Remaning rows: ",dim(df)[1]))
Remaning rows: 39972
# Split between train and test
df_train = df[df[1] == 1,-1]
df_test = df[df[1] == 2,-1]
# Split the test set between features and labels
X_train = df_train[,-1]
Y_train = df_train$df_toxic
# Split the train set between features and labels
X_test = df_test[,-1]
Y_test = df_test$df_toxic
X_train